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Abstract. 

We describe an algorithm for computing an inverse spherical harmonic transform suitable for 
graphic processing units (GPU). We use CUDA and base our implementation on a Fortran90 
routine included in a publicly available parallel package, s 2 hat. We focus our attention on the 
two major sequential steps involved in the transforms computation, retaining the efficient parallel 
framework of the original code. We detail optimization techniques used to enhance the performance 
of the CUDA-based code and contrast them with those implemented in the Fortran90 version. We 
also present performance comparisons of a single CPU plus GPU unit with the s 2 hat code running 
on either a single or 4 processors. In particular we find that use of the latest generation of GPUs, 
such as NVIDIA GF100 (Fermi), can accelerate the spherical harmonic transforms by as much as 18 
times with respect to s 2 hat executed on one core, and by as much as 5.5 with respect to s 2 hat on 
4 cores, with the overall performance being limited by the Fast Fourier transforms. 

The work presented here has been performed in the context of the Cosmic Microwave Background 
simulations and analysis. However, we expect that the developed software will be of more general 
interest and applicability. 

1. Introduction. Spherical harmonic transforms are ubiquitous in diverse areas 
of science and practical applications, which need to deal with data distributed on a 
sphere. In particular, they are heavily used in various areas of cosmology, such as 
studies of the cosmic microwave background (CMB) radiation and its anisotropics, 
which have been our main motivations for this work. CMB is an electromagnetic 
radiation left over after the hot and very dense stage of early evolution of our Universe. 
The CMB measurements allow us to look back directly at the Universe when its age 
was only a small fraction (~ 3%) of its current one (~ 13Gyrs), and indirectly to 
learn about its status as far back as to ~ 10~ 35 sec after its nominal beginning (so 
called Big Bang). Not surprisingly, the CMB measurements play a vital role in the 
present-day cosmology and have been a driving force behind turning it into a high 
precision, data-driven science it is today. 

The CMB radiation is nearly isotropic but minute deviations, on order of 1 part in 
10 5 , were first theoretically predicted and later detected. These so-called anisotropies 
encode the information about the Universe, its past and composition, and their de- 
tection and characterization has the major target of the CMB observations since the 
moment of its discovery in 1965. Over the time progressively more sophisticated and 
advanced observational apparata have been designed and deployed in search for their 
more subtle and taletelling characteristics. These include three major CMB satel- 
lites - American: Cosmic Microwave Background Explorer (COBE)[13], Wilkinson 
Microwave Anisotropy Probe (WMAP) [2], and European Planck 1 - and a few dozen 
of ground-based and balloon-borne projects. Some of these are operating at this time, 
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including WMAP and Planck, and more are planned for the near and medium-term 
future, including potential new satellite missions, considered currently in Europe, US, 
and Japan. 

The CMB detector technology has been improving quickly over the past decade, 
propelling a continuous increase of a number of detectors per experiment at the rate 
reminiscent of the Moore's law. This in turn has been driving a similar increase of the 
CMB data sets sizes, posing a formidable challenge for the CMB data analysis. This 
challenge can be only met if efficient numerical algorithms and the latest computer 
hardware are employed to match the data size increase by a concurrent increase in 
our processing capability. Spherical harmonic transforms are some of the most fun- 
damental tools used in the CMB data processing. This is because the CMB signal 
is naturally a function of the observational direction and thus can be adequately de- 
scribed as a field defined on a sphere. The spherical harmonic functions are a suitable 
basis to represent and manipulate such fields. The spherical harmonic transforms 
involve a decomposition of the signals defined on the sphere into a set of harmonic 
coefficients (i.e., a direct spherical harmonic transform) as well as synthesis of the sky 
signal given a set of harmonic expansion coefficients (i.e, an inverse spherical harmonic 
transform). The latter is for instance a key step in massive Monte Carlo simulations 
used in the CMB data processing. As they usually require a very high resolution and 
precision, synthesis operations, referred hereafter as alm2map transforms, are partic- 
ularly time and resources consuming. In this work we therefore focus specifically on 
alm2map transforms, and discuss their implementation on the NVIDIA GPU architec- 
ture within the CUDA framework. Similar techniques to these described here should 
be sufficient for an efficient implementation in this context of the direct map2alm 
transforms. We leave those for the future work. We note that the spherical har- 
monic transforms are commonly used beyond cosmology, for example, in geophysics, 
oceanography, or planetology and for all of which the implementation described here 
should be directly relevant. 

There are several packages available implementing the spherical harmonic trans- 
forms with healpix 2 , S 2 hat 3 , GLESP 4 , CCSHT 5 , particularly popular in the CMB 
research. Out of these, we have selected S 2 hat (Scalable Spherical Harmonic Trans- 
form) as the starting point for this research and a reference for performance compar- 
isons. While all these packages implement similar numerical algorithms, in particular 
S 2 hat originated from the HEALPIX routines, only S 2 HAT is not strictly tied to any spe- 
cific sky pixelization or discretization schemes. S 2 hat is written in Fortran90, fully 
parallelized using MPI, and shows memory scalability, good speedup and load-balance 
over a wide range of considered problems. 

Our primary final target are however heterogeneous, multi-processor systems 
made of multiple CPUs, each accompanied by a respective GPU. As the first step 
towards achieving this goal we focus on porting the two main, serial steps in the 
calculation of the transforms onto GPUs and retain the data distribution layout and 
communication structure of the original MPI code. Consequently, when run on a 
multi-processor/multi-GPU platform our code employs MPI calls to distribute the 
data and workload over all the CPUs, which then send them to their respective GPUs, 
where the bulk of the computation is performed. The MPI communication pattern 
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Fig. 1.1: An example output of the CUDA alm2map routine the implementation of 
which is described in this paper. The figure shows a simulated picture of the sky as 
seen in the microwave band of electromagnetic radiation. The units are arbitrary. 



and work distribution inherited from the S 2 HAT package have been both demonstrated 
to scale well in terms of memory load and execution time, and therefore we anticipate 
that the overall performance gain due to the speed of the GPUs will translate directly 
to an analogous total speed-up of the complete code. The proper end-to-end perfor- 
mance evaluation of the code on heterogeneous systems is the object of our on-going 
work. The performance tests presented in this paper focus specifically on the benefits 
due to GPUs and thus consider only single CPU/GPU cases. In addition this paper 
provides a first detailed description of the MPI parallelization of the efficient, scalable 
spherical harmonic transforms as implemented in S 2 HAT. Fig. 1.1 provides an ex- 
ample result produced with our CUDA code. (An interested reader can compare this 
figure with the actual observations produced by the WMAP satellite and published 
in [2]). 

The paper is organized as follows. In section 2 we introduce the spherical harmonic 
transforms and outline their MPI-based implementation. After a brief introduction 
to the CUDA programming model in section 3, we present a detailed description of 
our modified algorithm suitable for GPUs in section 4 and associated optimizations 
in section 5. Section 6 present some comparison results of both implementations. 
Section 7 concludes the paper. 

2. Spherical harmonic transforms. 

2.1. Algebraic background. For a real, scalar field, s, defined on the <S 2 -sphere 
the pair of spherical harmonic transforms is defined as follows, 

(2.1) aim = J]] s (9 p ,(f>p) Yim {9 P ,4> V ) , 

{o p ,4> P } 
C t 

(2.2) s{e p ,$ p )= 

dim Yim {Op, 4>p) ■ 

1=0 m=-l 

Here the coefficients a^ m define a harmonic representation of the field s, Yt m stands 
for a spherical harmonic, and {9, (j)) denote standard spherical coordinates. As in all 
practical application the field is pixelized or sampled on a discrete set of points, we 
have replaced the continuous integral in Eq. 2.1 by a discretized summation, which 
now goes over all the pixels on the sky. The upper limit, £ max in Eq. 2.2 defines the 
band-limit of the field s and is considered to be finite. In the CMB application it 
is usually determined by an experiment resolution and its typical values are £ max — 
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O (10 3 - 10 4 ). Hereafter, we will focus on the second of these two transforms and 
refer to it as the alm2map transform. Its objective is to reconstruct, or synthesize, the 
field, s, from its harmonic coefficients ag m on a grid of points p. (Hereafter, we will 
drop the index p for shortness.) 

The spherical harmonics are defined as, 

'- Vi m (cos 9) e"™'* 



(2.3) 



Ylm (9, 



where renormalized associated Legendre functions, Vg m (cos 9) are solutions of the 
Hemholtz equations, e.g., [1], and their normalization is selected to ensure that Yg m 
constitute an orthonormal basis on the sphere. 

S (9,<t>) = Y^ Yl atm Ytm ^ 

1=0 m = ~l 

(ft.4^ aeoVeo (cos<9) + ^ e lm< * ^ (cos 9) + e _im0 °L?Wcos0), 

g — Q m— 1 H — m m—1 H—m 

where we use the fact that, 

(2.5) V lm (costf) = (-l) m V l( - m) (cos^) 

(2.6) 



aim = (— l) a 



t(-m)- 



The latter explicitly assumes that the map, s, is real and a dagger denotes a complex 
conjugation. We can introduce now a set of functions, A m (0), such as, 



(2.7) 



A m (0) EE 4 



a eo Vita (cos 6») , m = 0; 

^max 

a^ m P <m (cos 6») , m > 0; 



and rewrite Eq. (2.4) as, 
(2.8) 



^ <4| m | ^|m| (cos6>) , m < 0, 

*=|m| 



a (0, 0) = ]T e"^ A m (0) . 



The associated Legendre functions can be computed via a 2-point recurrence, e.g., 
[I], with respect to the multipole number, I. It reads, 



(2.9) 
where 
(2.10) 



Pi + 2,m (x) = Pe + 2,; 



Pi, 



xVt+i, m (x) + 



je+i,m 



-V lm (x) 



4P - l 



I 2 — rn? 

The recurrence is initialized by the starting values 

l 



(2m + l)! 

47T 



(2.ll) 
(2.12) 



2 m m! 

= Mm (l - X ) , 

Vm + l,m (x) = Pl+l, m X Vmm {x) . 
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(l-X ) 



The recurrence is numerically stable but requires double precision and a care has to 
be taken to ensure it does not under- or overflows. We describe a relevant algorithm 
in the next Section. Eqs. 2.7 and 2.8 provide a basis for the numerical implementation 
of the spherical harmonic transforms. 

2.2. Current Approach. A detailed description of the efficient serial imple- 
mentation of the transforms can be found elsewhere [6]. Here we briefly outline the 
most important features, highlighting in particular the parallel aspects. 

2.2.1. Numerical complexity. From the sphere sampling considerations [4], 
we know that to properly sample a band-limited function with the band-limit set to 
(-max we need roughly n pix ~ ( max points on the sphere. Therefore to perform the 
operations required to calculate A (9), and as detailed in Eqs. 2.7, we need as many 
as O (n pix ) floating point operations (FLOPs). This is because for each of n pix pixels 
we have to do the Vi m recurrence for all i and m numbers, and there are O (t max ) ~ 
O (npi x ) of (£, m) pairs for a properly sample field. This is clearly a prohibitive scaling. 
It can however become more favorable if the problem is restricted to some specific 
sky pixelization/discretization schemes [4]. In particular, in the following we will 
always assume that all pixels/sky samples are arranged in a number of so-called iso- 
latitudinal rings, each of which have the same value of the polar angle, 9. Typically 
there will be n r i ngs <~ t m ax rings with each ring uniformly sampled <~ i ma x times. 
Moreover, we will assume that the sky is pixelized symmetrically with respect to the 
equator. Such schemes indeed have been proposed and demonstrated to work well 
in practice [4, 7, 6, 3] in a number of applications. With these constraints imposed 
on the pixelization the scaling for Eq. 2.7 is now O(n^), given that the full Vi m 
recurrence needs to be now done only ones for each of the rings. The numerical cost 
of the final summation, Eq. 2.8, is then sub-dominant as it can be implemented using 
Fast Fourier transform (FFT) techniques, at the total cost of O (n P i X lnn^) FLOPs. 

We note here in passing that for this class of pixelizations even faster algorithms 
have been proposed with the complexity either on order of 0[n pix (\nn pix ) 2 } [4] or 
O (n P i X \nn P i X ) [14]. However, they have a significant prefactor, involve complex al- 
gorithmic solutions, and have not been demonstrated to be numerically viable for 
imax > 100. 

2.2.2. Algorithm. The implementation of Eqs. 2.7 and 2.8 is rather straightfor- 
ward. The pseudo code is outlined as Algorithm 1. Two steps which require somewhat 



Algorithm 1 Basic alm2map ALGORITHM 

STEP 1 - A m CALCULATION 

comment: Algorithm 2 has to be embedded below, 
for every ring r do 

for every m = 0, m max do 
for every I = m, ...,C do 

- compute Vim via, the 2-point recurrence, Eq. 2.9; 

- update A m (r), given input ae m and computed Vim, Eq. 2.7; 
end for (I) 

end for (m) 

STEP 2 - S CALCULATION 

- calculate s via FFT, given A m (r) pre-calculated for all m; 
end for (r ) 
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more attention are the recurrence and the FFT. The two point recurrence as the one 
in Eq. 2.9 spans a huge dynamic range of values. This range depends on the values of 
£, which need to be considered, but already for values as low as O (10 2 ) it exceeds that 
accorded to a double precision number on a typical processor. As we have freedom 
to rescale all the values of Vi m , we can try to make use of it to rescale the recur- 
rence starting values, e.g., V mm and V m +i, m , appropriately to avoid the overflows 
later. However, this simple fix works only as long as the rescaled values of V mm and 
r P m +i,m do not cause underflows. Though this on its own is not an issue, as these 
two will be typically set to zero what is clearly a good approximation to their values, 
our 2-point recurrence as a result will never produce any other outcome than zero. 
This will apply also to those Legendre functions, which initially produced the overflow 
and were supposed to be brought to within the representable range of values via the 
rescaling. A more robust solution to the problem is that of real-time rescaling. In the 
scheme usually used for this the newly computed values are tested if they approach 
over- or underflow limits and whenever this is the case they are rescaled appropriately. 
The rescaling coefficients (e.g., in form of their logarithms) are kept tracked of and 
used to rescale the computed values of Vt m at the end as required. This scheme is 
based on two facts. First, that the values of the Legendre functions calculated via the 
recurrence change gradually and rather slowly on each step. Second, that their final 
values are well-within the range of the double precision values. 

The specific implementation of these ideas used in the S 2 HAT software, and derived 
from the solution coded in the HEALPix package, uses a precomputed vector of values, 
sampling the dynamic range of the representable double precision numbers and thus 
avoids any explicit computation of numerically-expensive logarithms and exponentials. 
The scaling vector, referred to hereafter as a rescale table is used to compare the 
values of V\ m computed on each step of the recurrence, and then used to rescale them 
if needed. 

The respective pseudo-code for the Legendre function recurrence is presented as 
Algorithm 2. The associated Legendre function recurrence is normally performed on- 



Algorithm 2 2-point associated Legendre recurrence 

- initialize the rescaling table; 

- precompute fi coefficients, Eq. 2.11; 
for every ring r do 

for every m = 0, m max do 

- initialize the recurrence: Vmm, V m +i,m, Eqs. 2.11 & 2.12, using precomputed fj, m ; 
-precompute recurrence coefficients, /3t m (fixed m, £ £ [m,£ max ]), Eq. 2.10; 
for every I — m + 2, ...,£ max do 

- compute Ve, m given Vt-i, m an d Vt-i m , given precomputed f3e m , Eq. 2.9; 

- test the value of Ve+2m against the rescaling table; 

- rescale Ve+2,m and Ve+i, m "° needed, keep the info about the rescaling coefficient; 
comment: Vim needs to be scaled back before being used in the calculations of 
the functions, A m ; 

end for (£) 
end for (m) 
end for (r ) 



the-fly and Algorithm 2 is thus merged with the algorithm for the alm2map transform, 
Algorithm 1. 

The application of the FFTs in the last step of Algorithm 1 also requires some 
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care. This is because the number of samples per ring may be either larger or smaller 
than the number of available m-modes. The former case can be dealt with by assuming 
the missing mode amplitudes to be zero. In the latter case the extra m modes have 
to be wrapped up and co-added to the modes present in the box. We refer the reader 
to paper [6] for more details of the involved calculations. 

2.2.3. MPI parallelism. The structure of the parallel implementation of Al- 
gorithm 1 is determined by the data layout. For a properly sampled full sphere the 
input set of the harmonic coefficients, a^ m , and the output sky map, s, are roughly of 
the same size as ~ n pix ~ £ max - These two objects are typically large and preferably 
have to be distributed. 

S 2 HAT divides the 2-dimensional a em array by assigning to a processor i a subset 
of all coefficients with predefined m values, Mi- The 2-dimensional sky map s is 
treated as a collection of rings, r, so a subset, IZi of complete rings is distributed to a 
processor i. Recall that r corresponds to a subset of pixels, identified by a unique 9. 
This ensures the memory scalability of the algorithm if only the sizes of all the sets, 
Mi and TZi, decrease as roughly ~ l/n procs . 

The algorithm, see Algorithm 3, proceeds then in two steps. First, given the in- 
put subset of all {ae m ,m e Mi} coefficients, each processor calculates, using Eq. 2.7, 
the A TO functions for every ring, r, of the sphere and m G Mi- Once this is done, 
the global communication is performed so that at the end each processor has in its 
memory A m functions calculated for the subset of rings as assigned to this processor, 
IZi, and all m values. Given these data each processor performs the second step of 
the sky calculations, Eq. 2.8, using FFTs. We note that once the data distribution is 
performed then steps 1 and 2 of the algorithm are embarrassingly parallel. The mem- 
ory required to store the intermediate products, A m , is on order of O (n p i x /n procs ) 
and therefore are comparable to that used to store the input and output objects in 
their distributed form. Also like the latter they decrease as a number of employed 
processors increases, preserving the overall memory-scalability of the algorithm. In 
its current form the S 2 HAT implementation is memory-distributed and implemented 
using MPI. Adding the openMP layer could be certainly useful as it could help to al- 
leviate the communication bottleneck and thus facilitate running the library at even 
higher concurrencies. Nevertheless openMP is expected to have no major impact on 
the computational efficiency of steps 1 and 2 of the algorithm due to their embarrass- 
ingly parallel character. Considering the accelerators, such as GPUs, is therefore a 
potentially attractive avenue to explore. 

3. NVIDIA CUDA Programming Model. NVIDIA CUDA is a general pur- 
pose parallel computing architecture - with a new parallel programming model and 
instruction set architecture - that takes advantage of the parallel compute engine in 
NVIDIA GPUs. CUDA comes with a software environment that allows developers 
to use C as a high-level programming language. Other languages and application 
programming interfaces are also supported. 

A CUDA-enabled GPU is basically a manycore chip consisting of hundreds of 
simple cores, called Streaming Processors (SP), together with control logic for the 
different levels of encapsulation. Each SP executes instructions sequentially, has a 
pipeline, 2 arithmetical-logical units and one floating point unit. The older generations 
do not have a cache. Several SPs are encapsulated in a Streaming Multiprocessor 
(SM). Multiple SMs form a Texture/Processor Cluster (TPC). All the TPCs form the 
Streaming Processor Array. 
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Algorithm 3 S 2 hat alm2map ALGORITHM - MPI implementation 



comment: Code executed by each MPI process. 

STEP 1 - A m CALCULATION 

- step 1.1- initialize the rescaling table; 
step 1.2 - precompute fi coefficients, Eq. 2.11; 
for every ring r do 

for every m G Mi: do 

- step 1.3 - initialize the T6CUTT6HC6'. l^mmy ^m+l,mj EQS. 2.11 2.12, USlIlg pTGCOlYl- 

puted fi m ; 

- step 1.4 - precompute recurrence coefficients, j3t m (fixed m, I £ [m, i m ax\), Eq. 2.10; 
for every I = m + 2, ...,£ ma x do 

- step 1.5 - compute Vi m via the 2-point recurrence, given precomputed f3i m , 
Eq. 2.9; 

- step 1.6- test the value of Vim against the rescaling table; 

- step 1.7 - update A m (r), given input a^ m and computed Tt m , Eq. 2.7; 
end for (£) 

end for (m) 
end for (r) 

GLOBAL COMMUNICATION 

- redistribute {A m (r) , m G Mi, all r} MPI - A i^° a111 ' {A m (r) ,r6R„ all m} 

STEP 2 - S CALCULATION 

for every ring r £ IZi do 

- using FFT calculate s (r, <f>) for all samples <f> € r, given pre-computed A m (r) for all 
m. 

end for (r) 



Inside one SM, the SPs execute in a SIMD fashion, while different SMs may 
execute different parts of the instruction stream in a SPMD fashion. Each SM also 
manages hundreds of active threads in a cyclic pipeline in order to hide memory 
latency. In practice, 32 threads are grouped together and scheduled as a Warp, 
executing the same instructions. 

The latest two architecture generations available from NVIDIA are GT200 and 
GF100 (Fermi). The best model of GT200 has a total of 240 SPs (8 SP/SM x 3 
SM/TPC x 10 TPC), with 16 KB of shared memory per SM and 16K available 
registers per block. It has a better implementation for memory fetching, reducing 
the earlier generation (G80) bottleneck produced by uncoalesced read/writes. Its 
theoretical peak performance is 1062.72 GFLOPS, in single precision. For double 
precision, FLOP count is 8 times lower. GF100 increases the number of SPs to 512 
(32 SP/SM x 4 SM/GPC x 4 GPC), shared memory to 64 KB and adds a Level 
1 caching mechanism. Theoretical peak performance for single precision is 1344.96 
GFLOPS and double precision is half this value. However, for the commercial GTX 
480 graphics card, the double precision performance is intentionally limited to one 
eighth of single precision. The only products using the entire DP capacity of the chip 
are those in the NVIDIA Tesla line. 

The CUDA threads are grouped together into a series of thread blocks. All 
threads in a thread block execute on the same SM, and can synchronize and exchange 
data using shared memory. Synchronization between thread blocks is not possible. 
One more level of encapsulation is possible as the thread blocks can be grouped in 
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independent grids. The Fermi chip is equipped with a resizable Level 1 cache (using 
shared memory for storage) which has the purpose of removing manual data copy 
between the slow device global memory and the fast shared memory. It is enabled by 
default, but its size can be switched between 16 KB to 48 KB. 

Various limitations and roadblocks have to be taken into account while developing 
algorithms on such architectures. First, the performance for the total chain of com- 
putation has to take into account the transfer time between the CPU and GPU main 
memory. For smaller algorithms, this transfer can represent more than ten times the 
computation step itself. One is then tempted to fit large segments of data in the GPU 
memory to limit the number of times the data has to be sent and received. However, 
doing so will put a heavy constraint over the memory management code inside the 
kernel. To fasten access to the global memory, data is usually fetched into some local 
memory segments. However, using large segments of local memory reduces the size 
of the registers bank, thus leading to a large amount of slower code spill. Balancing 
between transfer time, size and actual memory management strategy is a very impor- 
tant problem to solve for any high performance algorithm executing on GPUs. We 
address these issues in the following section in the context of our application. 

We also considered OpenCL as alternative for the development, but opted for 
CUDA, which we expect is best suited to exploit the capabilities of the studied hard- 
ware. 

4. alm2map with CUDA. Programming philosophy for CUDA dictates using 
fine grained parallelism and launching a very large number of threads in order to 
use all the available cores and hide memory latency. Since the loop computing the 
two-point recurrence is serial in nature and therefore cannot be broken into parallel 
segments, there are two remaining choices for parallelization: the m-loop and the ring 
loop. 

The initial CPU approach involved parallelizing only the m-loop, by having each 
process compute all the ring values for a subset of m values. This method of paral- 
lelization makes it easy to write code for MPI, as each process works on a subset of 
m values. 

This approach is not appropriate for the GPU however, because of shared memory 
limitations. The size of vector f3i m , Eq. 2.9, depends on i ma x and therefore cannot be 
stored in shared memory. Its values need to be recomputed for each m and are accessed 
sequentially in the ^-loop. A possible implementation would require re-computing the 
film coefficients for each pass through the i?-loop. However, these expensive, repeating 
calculations would seriously limit performance. 

Parallelizing the ring loop (step 1.1 in algorithm 4) avoids this problem and 
has additional advantages. Each thread is assigned a number of rings for which it 
computes the 2-point recurrence for all to- values. The consequence is that each thread 
processes a^ m values at the same to and I coordinates, in parallel. This makes it easy 
to plan the computation of (3i m and fi m , Eq. 2.11, in segments, as well as caching 
the a,g m values. An important added benefit is reusing these two vectors, by sharing 
them inside a thread block. Algorithm 4 shows the outline of the GPU computing 
kernel. It is observable that the three new steps (1.2, 1.3 and 1.4) are designed to work 
around the high latency device memory and take advantage of the fast, but small, 
shared memory. Steps 1.2 and 1.3 calculate the values of the /x m and (3i m vectors 
in segments, as they do not fit in shared memory and it would be slow and wasteful 
to store them in global memory. Step 1.4 tries to keep a supply of ai m values for 
the 2-point recurrence, therefore allowing a more continuous operation of the floating 
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Algorithm 4 S 2 HAT alm2map ALGORITHM - CUDA IMPLEMENTATION 



STEP 1 - A m CALCULATION 

- step 1.1- assign rings for each thread 
for every r e TZj do 

for every m e Mi do 

- step 1.2 - thread in block computes a segment of fj, m ; 
for every I — m + 2, ...,£ ma x do 

- step 1.3 - use precomputed or, if needed, precompute in parallel a segment of 
f3 lm , Eq. 2.10; 

- step 1.4 - use fetched or, if needed, fetch in parallel a segment of aim map data; 

- step 1.5 - compute Vim via the 2-point recurrence, Eq. 2.9; 

- step 1.6 - handle overflow /underflow using rescaling table; 

- step 1.7 - update A m (r), given prefetched aim and computed Ve m , Eq. 2.7; 
end for (£) 

end for (m) 
end for (r) 

GLOBAL COMMUNICATION 

- redistribute {A m (r) , me Mi, all r} MPI - A i^ oallv {A m (r) , re Hi, all m} 

STEP 2 

- using FFT calculate s (9,<f)) f° r a ^ samples for every, r elZi, given pre-computed A m (8) 
for r elZi and all m. 



point units by decreasing memory wait time. Step 1.1 is where the threads select the 
rings on which to work upon. Since the m-loop and ring-loops are interchangeable, 
unlike the CPU version, the ring loop is first, allowing the sharing of the fj, m and (3i m 
vectors. Figure 4.1 displays the acces by each thread of the 2-dimensional arrays a^ m 
and (3e rn . 



M mmax-1 



(a) Access pattern for input data ( (b) Output matrix (A m ) write 
aim ) pattern 



Fig. 4.1: Input and output data access patterns 



5. Optimizations for GPU. GPU code optimization uses different rules than 
regular, CPU based code optimization. Due to the massively parallel structure of the 
target architecture, GPU code needs to fulfill a different set of constraints. Among 
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these, the relationship between the cost of memory access and amount of computa- 
tions per kernel is exacerbated. As shown in [16] for example, it can be far more 
beneficial to recompute large segments of constant values instead of fetching them 
from main memory. Others [8] show that, in some cases, the most direct algorithm 
can outperform the CPU optimized one. Another source of performance loss is thread 
divergence due to asymmetrical branching in control flow. Such divergence is usually 
detected by various profilers, but can prove hard to remove. 

Based on guidelines for CUDA kernel optimization [11, 10, 12] and previous expe- 
riences, we mainly looked at limiting the effect of the slow global memory by buffering, 
precalculating or reusing data, removing branching in performance-critical sections 
and canceling warp serialization. 

Array segmentation. Due to shared memory small size, it is required to com- 
pute the f3e m vector in segments, on the fly (step 1.3). Pre-calculating it entirely in 
device memory (akin to the original CPU implementation) would be very slow, as 
completing one Pi m value requires reading the entire vector. (3g m segments are com- 
puted inside the ^-loop. Since f3i m is accessed sequentially, a portion of the vector is 
computed then used in the following steps of the recurrence. When existing values 
are exhausted, the next portion is computed. The size of the segment influences code 
performance, as it can be seen in the performance section. The same philosophy is 
applied for the fi m vector. Only difference is the segments are computed inside the 
to- loop (step 1.2). The advantage of having the code process the same at m data is 
that the two vectors are computed only once (in a parallel and serial fashion, respec- 
tively) and then reused by all threads in a thread block. As expected, the runtime 
decreases with increase in the number of threads. 

A similar approach to segmentation is employed for offsetting memory latency for 
reading the a^ m coefficients and transferring them only once before being used by all 
threads in a block (step 1.4). ag m values are transferred in segments during the Plm 
computation in step 1.5. Optimal segment size for all three vectors is found through 
testing. It is input size and platform dependent. Because of this, an autotuner is 
the best solution for obtaining the best possible performance. By running it for an 
input of the size targeted for computation, with all representative segment size and 
thread count variations, the best set of segment sizes can be selected. Currently, 
combinations that provide acceptable performance in all cases have been found by 
manual inspection and are being used as defaults in the code. 

The nature of /3^ m and a tm allows their values to be obtained in parallel, by 
computing or fetching (steps 1.3 and 1.4, respectively). The number of threads which 
perform this operation is directly linked to segment size. In particular, the segment 
size must be a multiple of the number of threads. This avoids additional code for 
handling outlier indexes in performance-critical sections. Keeping in line with the 
CUDA guidelines on shared memory access for avoiding bank conflicts, the threads 
in a block calculate values sequentially, with a stride of block size. Due to its serial 
nature, fi m is computed by a single thread, while others wait for its completion (step 
1.2). 

There is one more type of global information, called pixelization data. It also 
comes in the form of two arrays, but they are not cached. This is because they are 
rarely accessed and caching would complicate the code with no speed benefits. 

Branch collapsing. Code branching can severely impair the performance of 
GPU code, as divergent code is executed sequentially, effectively canceling parallelism. 
For example, when an "if" statement is encountered, some threads execute the true 

11 



branch, while the other wait, then execute the false branch with the first group wait- 
ing. This problem is solved by collapsing the branch into code that has the same 
outcome as before, but can be executed in parallel by all threads. The computational 
overhead is smaller than that incurred by process-and-wait execution. Conditional as- 
signments like if (c) v=tv else v=fv are converted to v=c I tv & !c I fv. The 
use of binary operators makes this expression very fast to compute. It is a common 
technique when converting code to SIMD operations. On the GPU however, this can 
be applied only on operations with integer numbers, as binary operators are not ap- 
plicable to floating point operands. An equivalent version is based on multiplications 
and subtraction: v = tv*c + fv*(l-c). This severely increases the overhead and 
makes it applicable only in some cases. For S 2 HAT code, this version was employed in 
both full and short form (if-then) resulting in decreased branching but with limited 
influence on execution time. 

Other approaches have been tried for using the resources of the GPU as much 
as possible. While none of them provide increased performance, they do offer some 
insight into the operation of this new platform and serve as lessons for the future. We 
describe them briefly in the following. 

Warp serialization. Warp serialization for arrays of double precision floating 
point stored in shared memory is a problem for GT200 chips. Since a memory bank 
holds only 32 bit values, a 64 bit value is stored in two different banks. When the 
number of threads grows beyond half the number of banks, values are accessed con- 
currently from the same bank. However, a bank can only service one request at a 
time and threads making additional requests are serialized, waiting for the previous 
request to complete. NVIDIA Programming guide suggests one possible solution as 
splitting the 64 bit value into 2 32-bit ones, storing them in two different vectors and 
then rejoining at use [11, p. 156]. However, after testing this technique on each vector 
of double precision data stored in shared memory, it proved useless. On the GT200 
architecture, the computational cost of splitting and joining the values outweighed 
that of warp serialization. Moreover, the newer GT400 chips addressed this problem 
and 64 bit floats no longer cause warp serialization. 

Dedicated scaling table. The scaling table is subject to a different kind of warp 
serialization. When threads in a block enter the rescaling phase, they access the data 
inside the array in a random fashion (some elements accessed by a single thread, others 
by multiple). Being small in size (21 64-bit values), the simplest approach for canceling 
serialization is having a copy of each table for each thread. However, experiments 
showed that while serialization does not occur, the time gain is insignificant even for 
small inputs. Also, as the number of threads increases, the amount of shared memory 
used becomes a limiting factor (for just 64 threads, 10.5 KB are needed). 

f3e m precalculation. Based on the ability to execute a very large number of 
mathematical operations and the drawback of high device memory latency, a method 
for obtaining a good throughput is computing values on-the-fly instead of precalcu- 
lating them. This trades computing cycles for memory cycles and some algorithms 
gained significant performance in this manner. /3^ m calculation inside the £-loop 
turned out to greatly increase computation time over both precalculation-based ver- 
sion and segment-based version. This is due to the high number of expensive op- 
erations involved in computing a single value of (3i m (multiplications, divisions and 
square roots), making reuse essential. Computing the scaling factors at usage-time 
had the same problem of expensive operations (powers, in particular). 

Branch collapsing for scaling. Each iteration of the inner, i?-loop involves 
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checking the value to be inside a validity interval and apply a scaling factor if this 
is not the case. This requires two "if" checks and can result in branching, impairing 
performance, as threads can go on 3 different execution paths. By collapsing the code 
using the technique described above, branching was reduced, but had the adverse effect 
of increased execution time. This is caused by the high number of multiplications 
forced on each thread by both the scaling code (which does not always execute) and 
operations introduced by the collapse method itself, since bitwise operators are not 
available for floating point. 

6. Experimental results. Two platforms have been used for testing the code: 
GTX 260 for NVIDIA GT200 architecture and GTX 480 for the new NVIDIA GF100 
(Fermi). Their host systems are: AMD Phenom 9850 (4 cores) with 8 GB of PC3200 
DDR2 memory running on a MSI MS-7376 motherboard and Intel Core i7-960 CPU 
(4 cores, 8 processes with Hyperthreading) with 8 GB of PC3200 DDR2 memory 
running on a Gigabyte EX58-UD5, respectively. 

The number of theoretical double precision FLOPS is significantly in favor of 
the GPUs, with a ratio of 1:2.2 and 1:3.2, respectively, when compared to the 51.2 
GFLOPS double precision performance of Intel i7-960. The GPU FLOPS counts a 
FMADD operation as two separate ones, for an easier comparison with the CPU. It 
is also taken into account the fact that the Fermi chip can process a FMADD and 
ADD operation in parallel. 

Also, the GPU architecture's massive parallelism of 260 and 480 SPs indicates a 
large advantage over the 4 physical cores of the referece CPUs. Even though the algo- 
rithm is known for near-linear scaling, due to the memory bound nature of the code 
(obtaining 10-15% of the theoretical peak performance), the algorithm was expected 
to get a significant, but limited, runtime improvement. It was anticipated that the 
high latency of the GPU global memory would further stall execution. 

On the CPU, the Fortran algorithm was used as reference. Its efficiency was 
computed using the FLOP count returned by the PAPI package. 

For the GPU, the execution time is calculated using the gettimeof day () library 
call between kernel launch and result retrieval. Because the consumer-grade cards 
used have limited memory, the largest dataset used is 4096x4096 and 5120x5120, re- 
spectively. In order to evaluate the possible runtime improvement for larger datasets, 
the output arrays were no longer allocated, allowing the input to fill the entire card 
memory. Results were written in a very small buffer (one value per thread) , in order 
to maintain memory access and not distort the results. In this manner, the dataset 
limit was pushed to 9216x9216. 

6.1. Parameter setup for A m . Algorithm performance is dependent on the 
following parameters: number of threads in a block, number of rings per thread, size 
of the buffers used for computing the fj, m and (3e m values, size of the input buffer, 
usage of 64-bit floating point split into 2 32 bit integers for storage in shared memory, 
usage of LI cache (for Fermi) and the usage of a shared or dedicated rescaling table 
for each block. We tested each of these parameters for their influence on runtime and 
selected the best results. 

Since computing the values for each ring can be done independently, each thread 
can calculate the values for any given number of rings. This would allow running 
a fixed number of threads, for a theoretical performance improvement (for example, 
running a number of threads equal to that of the SPs, resulting in no overhead from 
thread context switching - [16]). However, testing has shown that the optimum 
amount of rings for a thread is 1 . Higher numbers result in significant performance 
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Fig. 6.1: Histograms showing performance decrease (in percentages) due to Ll cache 
enabling on GTX 480 



degradation for any combination of parameters. Since other values would always 
return sub-par (and therefore useless) runtimes, all subsequent tests are performed 
with just one ring per thread. 

One of the difficulties raised by the GT200 architecture is the method used for 
storing 64 bit values in the shared memory banks. When the number of threads 
grows beyond half the number of banks, values are accessed concurrently from the 
same bank, triggering serialization which results in latency for data read/write. The 
solution for avoiding it employs a workaround method, suggested in the NVIDIA 
Programming Guide, by splitting the data into two 32 bit values, storing them in 
shared memory then merging back to the original form at retrieval. This is no longer 
necessary on the Fermi, as 64 bit array access no longer generates warp serialization. 
This technique was tested on the fi mi /3e m and ag m arrays, in all their possible com- 
binations (with or without using the Ll cache) as well as with all thread counts. It 
resulted in overall performance degradation for all cases. The reason for this is that 
the overhead for splitting and merging the values is greater than the time gained 
by avoiding serialization. The following tests were performed with 32-bit splitting 
disabled in all cases. 

Another aspect tested was the cache system of the Fermi chip. This is designed to 
handle automatic fetching of values from device memory and store them into the fast 
shared memory, a task usually performed by hand by the CUD A programmer. Since 
array data is accessed sequentially (see figure 4.1a), caching should be straightforward 
and benefit from a hardware implementation. However, after running tests with all 
combinations of segment sizes, thread count per block and Ll cache on or off, it 
was determined that, for this particular algorithm, the caching implemented by the 
GPU never improves and often degrades performance. This is visible in figure 6.1, 
which contains the histograms for performance degradation values when enabling Ll 
preference, for the different thread counts used. This decrease is most apparent for 
low thread counts (average speed degradation of 200%). For 64 threads per block, 
activating Ll-preference has no impact in half of the tests and ranges between 50% 
and 200% runtime increase for the rest. At 128 threads per block, half of the tests are 
not influenced, whilst the other suffer a 50-60% speed decrease. The input size used 
is large, 4096x4096, in order to obtain a relevant result. We can conclude that, for 
this particular case, manual buffering outperforms the Ll cache system of the Fermi. 

The rescaling table is the read only array most accessed by all threads. Each 
time an overflow or underflow event takes place inside the L-loop, a value in the 
table is read (algorithm 4 - step 1.6). Therefore, optimizing its access can have a 
significant impact on overall algorithm performance. Due to its small size (21 64 bit 
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values) , giving a dedicated copy of the table to each thread is a good way of avoiding 
warp serialization triggered by concurrent random access. However, as the number of 
threads increases, shared memory becomes a limiting factor. This is especially true 
on the GTX 280, which has just 16 KB available. Running the algorithm with the 
rescaling table shared by the thread block has shown that, contrary to expectation, 
the algorithm is faster by 30-50%. 

Subsequent experiments used a shared rescaling table, had LI cache preference 
(when ran on the Fermi) and 32-bit splitting disabled as well as treated just one 
ring per thread. The last three parameters to test are the lengths of the fi m , /3e m 
and ag m buffers. As expected, they have a major impact on the overall algorithm 
performance. Final performance values have been obtained by running all segment 
length combinations (16, 64, 128, 256 and 512 elements) with thread counts per block 
(16, 32, 64, 128, 256 and 512) for all input sizes. Since, for most cases, m max equals 
(max (the ai m matrix is square), the data sets used as input use the same restriction. 

Having such a large set of experimental data, analysis on the influence of each 
parameter was attempted, since it was not freely observable. It was revealed that 
segment length and thread count have a direct influence on the runtime, but a corre- 
lation between their association and runtime was not found. In addition, combinations 
that give the best results for a certain input size do not keep that property for other 
datasets. Also, it was discovered that the runtimes can be split into clusters of values, 
as defined by different parameter combinations. However, this grouping differs with 
the input size. 

In order to obtain the best performance for any input size, an autotuner is the best 
solution. Combinations providing near-best runtimes have been found by manually 
taking the parameters that provide the best time for a certain dataset and analyzing 
its performance when applied for the other datasets. Out of these parameter sets, 
the best overall was selected and used as default. As sufficient experimental data is 
available, this process can be further refined. 

By executing the algorithm on the massively parallel chip that is the GPU, the 
number of running threads and their configuration relative to the processing units 
influences runtime considerably. In order to produce the final performance results, 
the best time is selected when varying both the thread count in a block as well as the 
segment length. 

Figures 6.2a and 6.2b show the runtimes (in seconds) for different numbers of 
threads in a block. The entire range of inputs is considered, including those that fit 
into memory only with output allocation disabled. The best (lowest) times are marked 
by a box. We observe that, for both cards, the best runtime is obtained generally 
with 64 threads per block. In the cases where this is not the case (usually for 128 
threads), the difference is almost negligible. 

The two chips powering the GTX 260 and GTX 480, have a related, but signifi- 
cantly different architecture. However, as observed from the figures, the configurations 
that best use their capabilities are similar, using 64 or 128 threads per block. Unlike 
usual CPU logic, running more threads than execution units is beneficial. This is due 
to the high latency in accessing the device global memory. By using many threads, 
the Block Scheduler can replace blocks waiting for memory fetching with those ready 
for execution. The result is a higher throughput due to high reuse of idle threads. 

6.2. Performance of A m computation . In this section we discuss the per- 
formance of the code on the two GPU platforms (from the latest two generations), 
with respect to the CPU implementation running on two different processors. The 

15 




— 3K 
-»-2K 



Threads / Block 

(a) GTX 260 




-*-7K 
-»-6K 
-»-5K 
-•-4K 
^3K 
2K 
— IK 



Threads / Block 

(b) GTX 480 

Fig. 6.2: Runtimes for different thread counts per block. The input data size varies 
from IK to 6K. The best runtime for each input size is displayed in a box. 



entire range of input sizes is tested with all variations of segment lengths. The best 
times are then selected and used for calculating the runtime improvement relative to 
the CPU implementation. Efficiency and GFLOPS for each graphics card are then 
computed. 

The improvement factor from the GPU version is calculated against the reference 
Fortran MPI code running on the CPU. For AMD, the time duration obtained by 
running the program with 1 and 4 processes is used. The Intel i7-960 is equipped with 
Hyperthreading, meaning it can run 8 threads on just 4 physical cores. However, it was 
found that, in some cases, the 4 threads (MPI processes) version is faster. Therefore, 
one process and the best out of 4 and 8 processes is used as reference. The final 
runtime improvement factor for each input size is obtained by dividing the best time 
for each CPU by the best time of the GPU. When single-core is used as reference, the 
time measured while running the algorithm with just one process is divided by the 
best time of the GPU. 

Figures 6.3a and 6.3b show the runtime improvement for the two platforms used 
for testing (the latest generation GTX 480 and the older GTX 260) while using the 
entire range of inputs. We observe how larger inputs result in a higher improvement 
factor. Values rise sharply before starting to level at 4K (GTX 260) or 5K (GTX 480). 
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Fig. 6.3: Improvement factor obtained by A m calculation of alm2map with CUDA 
with respect to the MPI version ran on the AMD Phcnom and Intel i7 CPUs 



The graphs plot the values for input sizes that normally fit the cards used for testing 
as well as those that require output disabling. They are separated by a vertical line 
(normal inputs on the left). 

The AMD Phenom is slower than the Intel i7, therefore the improvement factor 
obtained will naturally be higher. When comparing the GTX 480 runtimes to those 
of single core CPU code, the performance ratio levels out at 60x for the Phenom 
and at 42x for the i7. For the older generation GTX 260, the factor is 3-3.5 times 
lower, at 17x and 12x, respectively. However, the relevant values are those obtained 
when using the CPUs to their full potential, with all their cores. The algorithm scales 
almost perfectly with the number of physical cores, the improvement values being 
generally one fourth of single core, with 14x and lOx for GTX 480 and 4x and 
3x for GTX 260. Intel Hyperthreading does not seem able to provide advantages by 
pushing scaling beyond the number of physical cores. 

Figure 6.4 displays the variation of efficiency with respect to the size of the input. 
The aspect of the curve is similar to that of the speedup graph, rising sharply before 
levelling at a 3K or 5K input. Being a memory-starved algorithm on the CPU, an 
adaptation to a faster chip, with a high-latency memory, was bound to suffer of the 

17 




same problem. Running S 2 hat on the Intel i7-960, it reaches an efficiency of just 
10%. As it can be seen in the graph, for the GTX 260, it mantains roughly the same 
value of 10%. On the GTX 480, however, it improves by a factor of 3, peaking at 
30%. Since on the commercial GTX 480 double precision is limited to a quarter of 
its performance, running the algorithm on a Tesla card (which does not have it) will 
probably decrease efficiency by exposing the memory latency issue, hidden by this 
limitation, but should significantly improve performance. 

In figures 6.5a and 6.5b we display the GFLOPS values obtained for each input 
size, for different counts of threads per block. Floating point operations methodology 
is as follows: additions and multiplications are computed as one operation each; due 
to lack of good references, divisions, square roots and logarithms arc each counted as 
20 operations. Rudimentary testing shows this value (20) to be an adequate estimate 
and, in either case, these operations combined are just 0.55% of the total number of 
floating point operations counted. Since GPUs do not yet have hardware counters 
for floating point operations, the operations were counted manually (adding values to 
variables in each thread followed by summation for the final results). 

As resulting from figures 6.5a and 6.5b, the number of resulting GFLOPS increases 
with the size of the input, with values leveling at 50-52 GFLOPS for GTX 480 (figure 
6.5b) and 14 for GTX 260 (figure 6.5a) 

6.3. Overall performance. The performance of the alm2map algorithm is greatly 
improved by offloading the A m computation onto a GPU. However, the second step, 
FFT calculation, requires attention also. In the original CPU-only code, the FFTs 
occupy 5-10% of the total runtime. Improving A m timing by a factor of 10 (Intel 
17-960, 4 processes), results in the FFTs becoming dominant. 

Since different FFT packages have different runtime characteristics, two CPU- 
only FFT libraries have been tested: one as implemented in Healpix [6] and the other 
- FFTW 6 [5]. Also, an FFT library for the NVIDIA GPUs, CUFFT [9], is employed. 
The current alm2map FFT implementation is a direct port of the original CPU version, 
with no specific GPU optimizations. 

Figure 6.6 shows the overall (A m + FFT) runtimes for all combinations of 



FFTW: http://www.fftw.org/ 
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A m computing code (Fortran on Intel i7-960 or CUDA on NVIDIA GTX 480), CPU 
FFT packages (Healpix or FFTW) and process count (1 or 4). Also, the runtime for 
a full GPU computation is plotted. Only the Intel i7-960 with NVIDIA GTX 480 
results are shown. 

We notice that, relative to the FFTW version, the Healpix package performs 
better for both 1 and 4 processes. We also observe that the best runtimes belong to 
the code running on the GPU. 

Figure 6.7 plots the overall runtime improvement over the CPU code versions with 
respect to the best performing GPU code (labeled "GTX480 A m + CUFFT" in figure 
6.6). We observe that, in the best case, the improvement is just half that obtained 
when considering just the A m computation (figure 6.3b), but also significant, reaching 
factors from 5 to 18 (when comparing to the best and worst, respectively, performing 
code). 

7. Conclusions and Future Work. This paper describes an algorithm for 
computing the inverse spherical harmonic transform on CPUs. The algorithm is com- 
pared with the implementation of the inverse spherical harmonic transform provided 
in S 2 HAT library, implemented using Fortran and MPI. The GPU algorithm leads to 
an improvement of up to a factor of 18 with respect to S 2 HAT on a single core and up 
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Fig. 6.6: alm2map overall runtime, Intel i7-960 and NVIDIA GTX 480 

to a factor of 5.5 with respect to S 2 hat on 4 cores of an Intel i7-960 machine. The 
improvement is limited by the performance of Fast Fourier transforms. 

Even though a single GPU offers high computing power, employing several is the 
easiest way of scaling performance as well as handling larger inputs. The algorithm 
has been designed with multi-GPU use in mind and it can directly fit into, and 
benefit from, the S 2 HAT MPI structure enabling straightforwardly distributed GPU 
computing. However special care has to be taken to ensure a good load balance among 
processors. This is the object of our current work. 

Acknowledgments. This work has been supported in part by French National 
Research Agency (ANR) through COSINUS program (project MIDAS no. ANR-09- 
COSI-009). 

REFERENCES 

[1] G. B. Arfken and H. J. Weber. Mathematical methods for physicists 6th ed. Academic Press, 
2005. 

[2] C. L. Bennett, M. Bay, M. Halpern, G. Hinshaw, C. Jackson, N. Jarosik, A. Kogut, M. Limon, 

20 




Fig. 6.7: alm2map overall performance improvement 



S. S. Meyer, L. Page, D. N. Spergel, G. S. Tucker, D. T. Wilkinson, E. Wollack, and E. L. 
Wright. The Microwave Anisotropy Probe Mission. ApJ, 583:1-23, January 2003. 
[3] A. G. Doroshkevich, P. D. Naselsky, O. V. Verkhodanov, D. I. Novikov, V. I. Turchaninov, 
I. D. Novikov, P. R. Christensen, and L. -. Chiang. First Release of Gauss-Legendre Sky 
Pixelization (GLESP) software package for CMB analysis. ArXiv Astrophysics e-prints, 
January 2005. 

[4] J. R. Driscoll and D. M. Healy. Computing fourier transforms and convolutions on the 2-sphere. 

Advances in Applied Mathematics, 15(2):202 - 250, 1994. 
[5] M. Frigo and S.G. Johnson. The design and implementation of FFTW3. Proceedings of the 

IEEE, 93(2):216-231, 2005. 
[6] K. M. Gorski, E. Hivon, A. J. Banday, B. D. Wandelt, F. K. Hansen, M. Reinecke, and 

M. Bartelmann. HEALPix: A Framework for High-Resolution Discretization and Fast 

Analysis of Data Distributed on the Sphere. ApJ, 622:759-771, April 2005. 
[7] P. F. Muciaccia, P. Natoli, and N. Vittorio. Fast Spherical Harmonic Analysis: A Quick 

Algorithm for Generating and/or Inverting Full-Sky, High- Resolution Cosmic Microwave 

Background Anisotropy Maps. ApJ, 488:L63+, October 1997. 
[8] Akira Nukada and Satoshi Matsuoka. Auto-tuning 3-D FFT library for CUDA CPUs. In SC 

'09: Proceedings of the Conference on High Performance Computing Networking, Storage 

and Analysis, pages 1-10, New York, NY, USA, 2009. ACM. 
[9] Nvidia. CUDA CUFFT Library, 2010. 
[10] Nvidia. NVIDIA CUDA Best Practices Guide, 2010. 
[11] Nvidia. NVIDIA CUDA Programming Guide, 2010. 
[12] Nvidia. Tuning CUDA Applications for Fermi, 2010. 

[13] G. F. Smoot, C. L. Bennett, A. Kogut, E. L. Wright, J. Aymon, N. W. Boggess, E. S. Cheng, 
G. de Amici, S. Gulkis, M. G. Hauser, G. Hinshaw, P. D. Jackson, M. Janssen, E. Kaita, 
T. Kelsall, P. Keegstra, C. Lineweaver, K. Loewenstein, P. Lubin, J. Mather, S. S. Meyer, 
S. H. Moseley, T. Murdock, L. Rokke, R. F. Silverberg, L. Tenorio, R. Weiss, and D. T. 
Wilkinson. Structure in the COBE differential microwave radiometer first-year maps. ApJ, 
396T1-L5, September 1992. 

[14] Mark Tygert. Fast algorithms for spherical harmonic expansions, ii. Journal of Computational 
Physics, 227(8):4260 - 4279, 2008. 

[15] V. Volkov and J. W. Demmel. Benchmarking CPUs to tune dense linear algebra. ACM/IEEE 
Conference on Supercomputing (SC08), 2008. 

[16] V. Volkov and J.W. Demmel. LU, QR and Cholesky factorizations using vector capabilities of 
GPUs. Technical Report UCB/EECS-2008-49, EECS Department, University of California, 
Berkeley, May 2008. 



21 



